Tool Decathlon
Tool Decathlon(简称 Toolathlon)是一个针对语言代理的基准测试框架,用于评估大模型在真实环境中使用工具执行复杂任务的能力。该基准涵盖32个软件应用和604个工具,包括日常工具如 Google Calendar 和 Notion,以及专业工具如 WooCommerce、Kubernetes 和 BigQuery。它包含108个任务,每个任务平均需要约20次工具交互。该框架于2025年10月发布,旨在填补现有评测在工具多样性和长序列执行方面的空白。通过执行式评估,该基准提供可靠的性能指
Updated Apr 25, 2026·765 views
- Problem Count
- 108
- Institution
- 个人
- Category
- AI Agent - 工具使用
- Metrics
- Accuracy
- Language
- 英文
- Difficulty
- 高难度
Overview
Tool Decathlon是一个用于评估大模型在真实环境中使用工具执行复杂任务的能力的评测基准
Related resources
Latest Tool Decathlon model rankings and full benchmark leaderboard
Browse the latest scores, model modes, release dates, and parameter sizes for Tool Decathlon.
Source: DataLearnerAI
Data sourced primarily from official releases (GitHub, Hugging Face, papers), then benchmark leaderboards, then third-party evaluators. Learn about our data methodology
Model Mode Legend
Tool Decathlon Rank
| Rank | Model | License | |||
|---|---|---|---|---|---|
![]() Kimi K2.6 Thinking EnabledTools | 50.00 | 2026-04-20 | 1000B | Free Commercial | |
![]() GPT-5.4 mini Thinking Level · Extra HighTools | 42.90 | 2026-03-17 | Unknown | Closed | |
![]() GLM 5.1 Thinking EnabledTools | 40.70 | 2026-03-27 | 75.4B | Free Commercial | |
4 | ![]() Qwen 3.6 Plus Preview Thinking EnabledTools | 39.80 | 2026-03-31 | Unknown | Closed |
5 | ![]() Qwen3.5-397B-A17B Thinking EnabledTools | 38.30 | 2026-02-16 | 39.7B | Free Commercial |
6 | ![]() GPT-5.4 nano Thinking Level · Extra HighTools | 35.50 | 2026-03-17 | Unknown | Closed |
7 | ![]() Qwen3.6-35B-A3B Thinking Enabled | 26.90 | 2026-04-16 | 35B | Free Commercial |



